
The stock market has consistently proven to be a good place to invest in and save for the future. There are a lot of compelling reasons to invest in stocks. It can help in fighting inflation, create wealth, and also provides some tax benefits. Good steady returns on investments over a long period of time can also grow a lot more than seems possible. Also, thanks to the power of compound interest, the earlier one starts investing, the larger the corpus one can have for retirement. Overall, investing in stocks can help meet life's financial aspirations.
It is important to maintain a diversified portfolio when investing in stocks in order to maximise earnings under any market condition. Having a diversified portfolio tends to yield higher returns and face lower risk by tempering potential losses when the market is down. It is often easy to get lost in a sea of financial metrics to analyze while determining the worth of a stock, and doing the same for a multitude of stocks to identify the right picks for an individual can be a tedious task. By doing a cluster analysis, one can identify stocks that exhibit similar characteristics and ones which exhibit minimum correlation. This will help investors better analyze stocks across different market segments and help protect against risks that could make the portfolio vulnerable to losses.
Trade&Ahead is a financial consultancy firm who provide their customers with personalized investment strategies. They have hired you as a Data Scientist and provided you with data comprising stock price and some financial indicators for a few companies listed under the New York Stock Exchange. They have assigned you the tasks of analyzing the data, grouping the stocks based on the attributes provided, and sharing insights about the characteristics of each group.
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 yellowbrick==1.5 -q --user
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.2 yellowbrick==1.5 -q --user
# !pip install --upgrade -q jinja2
# To suppress the warnings
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import numpy as np
import pandas as pd
# Libraries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# to scale the data using z-score
from sklearn.preprocessing import StandardScaler
# to compute distances
from scipy.spatial.distance import cdist
# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import KElbowVisualizer, SilhouetteVisualizer
# to compute distances
from scipy.spatial.distance import pdist
# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet
# to perform PCA
from sklearn.decomposition import PCA
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
# mount your google drive
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive
# loading the dataset
path = "/content/drive/MyDrive/UTAS/8Unsupervised_Learning/Project/stock_data.csv"
data = pd.read_csv(path)
# shape of the dataset
data.shape
(340, 15)
# look at random sample of 10 rows
data.sample(n=10, random_state=1)
| Ticker Symbol | Security | GICS Sector | GICS Sub Industry | Current Price | Price Change | Volatility | ROE | Cash Ratio | Net Cash Flow | Net Income | Earnings Per Share | Estimated Shares Outstanding | P/E Ratio | P/B Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 102 | DVN | Devon Energy Corp. | Energy | Oil & Gas Exploration & Production | 32.000000 | -15.478079 | 2.923698 | 205 | 70 | 830000000 | -14454000000 | -35.55 | 4.065823e+08 | 93.089287 | 1.785616 |
| 125 | FB | Information Technology | Internet Software & Services | 104.660004 | 16.224320 | 1.320606 | 8 | 958 | 592000000 | 3669000000 | 1.31 | 2.800763e+09 | 79.893133 | 5.884467 | |
| 11 | AIV | Apartment Investment & Mgmt | Real Estate | REITs | 40.029999 | 7.578608 | 1.163334 | 15 | 47 | 21818000 | 248710000 | 1.52 | 1.636250e+08 | 26.335526 | -1.269332 |
| 248 | PG | Procter & Gamble | Consumer Staples | Personal Products | 79.410004 | 10.660538 | 0.806056 | 17 | 129 | 160383000 | 636056000 | 3.28 | 4.913916e+08 | 24.070121 | -2.256747 |
| 238 | OXY | Occidental Petroleum | Energy | Oil & Gas Exploration & Production | 67.610001 | 0.865287 | 1.589520 | 32 | 64 | -588000000 | -7829000000 | -10.23 | 7.652981e+08 | 93.089287 | 3.345102 |
| 336 | YUM | Yum! Brands Inc | Consumer Discretionary | Restaurants | 52.516175 | -8.698917 | 1.478877 | 142 | 27 | 159000000 | 1293000000 | 2.97 | 4.353535e+08 | 17.682214 | -3.838260 |
| 112 | EQT | EQT Corporation | Energy | Oil & Gas Exploration & Production | 52.130001 | -21.253771 | 2.364883 | 2 | 201 | 523803000 | 85171000 | 0.56 | 1.520911e+08 | 93.089287 | 9.567952 |
| 147 | HAL | Halliburton Co. | Energy | Oil & Gas Equipment & Services | 34.040001 | -5.101751 | 1.966062 | 4 | 189 | 7786000000 | -671000000 | -0.79 | 8.493671e+08 | 93.089287 | 17.345857 |
| 89 | DFS | Discover Financial Services | Financials | Consumer Finance | 53.619999 | 3.653584 | 1.159897 | 20 | 99 | 2288000000 | 2297000000 | 5.14 | 4.468872e+08 | 10.431906 | -0.375934 |
| 173 | IVZ | Invesco Ltd. | Financials | Asset Management & Custody Banks | 33.480000 | 7.067477 | 1.580839 | 12 | 67 | 412000000 | 968100000 | 2.26 | 4.283628e+08 | 14.814159 | 4.218620 |
# data types for original dataset
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 340 entries, 0 to 339 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Ticker Symbol 340 non-null object 1 Security 340 non-null object 2 GICS Sector 340 non-null object 3 GICS Sub Industry 340 non-null object 4 Current Price 340 non-null float64 5 Price Change 340 non-null float64 6 Volatility 340 non-null float64 7 ROE 340 non-null int64 8 Cash Ratio 340 non-null int64 9 Net Cash Flow 340 non-null int64 10 Net Income 340 non-null int64 11 Earnings Per Share 340 non-null float64 12 Estimated Shares Outstanding 340 non-null float64 13 P/E Ratio 340 non-null float64 14 P/B Ratio 340 non-null float64 dtypes: float64(7), int64(4), object(4) memory usage: 40.0+ KB
# create a copy of the original data to avoid any changes to original dataset
df=data.copy()
# rename the columns by replacing spaces with underscore
df.columns=[c.replace(" ", "_") for c in df.columns]
# statistical summary for original dataset
df.describe(include='all').T
| count | unique | top | freq | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Ticker_Symbol | 340 | 340 | AAL | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Security | 340 | 340 | American Airlines Group | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS_Sector | 340 | 11 | Industrials | 53 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| GICS_Sub_Industry | 340 | 104 | Oil & Gas Exploration & Production | 16 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| Current_Price | 340.0 | NaN | NaN | NaN | 80.862345 | 98.055086 | 4.5 | 38.555 | 59.705 | 92.880001 | 1274.949951 |
| Price_Change | 340.0 | NaN | NaN | NaN | 4.078194 | 12.006338 | -47.129693 | -0.939484 | 4.819505 | 10.695493 | 55.051683 |
| Volatility | 340.0 | NaN | NaN | NaN | 1.525976 | 0.591798 | 0.733163 | 1.134878 | 1.385593 | 1.695549 | 4.580042 |
| ROE | 340.0 | NaN | NaN | NaN | 39.597059 | 96.547538 | 1.0 | 9.75 | 15.0 | 27.0 | 917.0 |
| Cash_Ratio | 340.0 | NaN | NaN | NaN | 70.023529 | 90.421331 | 0.0 | 18.0 | 47.0 | 99.0 | 958.0 |
| Net_Cash_Flow | 340.0 | NaN | NaN | NaN | 55537620.588235 | 1946365312.175789 | -11208000000.0 | -193906500.0 | 2098000.0 | 169810750.0 | 20764000000.0 |
| Net_Income | 340.0 | NaN | NaN | NaN | 1494384602.941176 | 3940150279.327936 | -23528000000.0 | 352301250.0 | 707336000.0 | 1899000000.0 | 24442000000.0 |
| Earnings_Per_Share | 340.0 | NaN | NaN | NaN | 2.776662 | 6.587779 | -61.2 | 1.5575 | 2.895 | 4.62 | 50.09 |
| Estimated_Shares_Outstanding | 340.0 | NaN | NaN | NaN | 577028337.75403 | 845849595.417695 | 27672156.86 | 158848216.1 | 309675137.8 | 573117457.325 | 6159292035.0 |
| P/E_Ratio | 340.0 | NaN | NaN | NaN | 32.612563 | 44.348731 | 2.935451 | 15.044653 | 20.819876 | 31.764755 | 528.039074 |
| P/B_Ratio | 340.0 | NaN | NaN | NaN | -1.718249 | 13.966912 | -76.119077 | -4.352056 | -1.06717 | 3.917066 | 129.064585 |
Ticker_Symbol and Security are all unique values. Thus we will drop these 2 columns.#Look at all unique values for `GICS_Sector`
df['GICS_Sector'].unique()
array(['Industrials', 'Health Care', 'Information Technology',
'Consumer Staples', 'Utilities', 'Financials', 'Real Estate',
'Materials', 'Consumer Discretionary', 'Energy',
'Telecommunications Services'], dtype=object)
#Look at all unique values for `GICS_Sector`
df['GICS_Sub_Industry'].unique()
array(['Airlines', 'Pharmaceuticals', 'Health Care Equipment',
'Application Software', 'Semiconductors', 'Agricultural Products',
'Data Processing & Outsourced Services', 'MultiUtilities',
'Electric Utilities', 'Life & Health Insurance',
'Property & Casualty Insurance', 'REITs', 'Multi-line Insurance',
'Insurance Brokers', 'Internet Software & Services',
'Specialty Chemicals', 'Building Products', 'Biotechnology',
'Semiconductor Equipment', 'Electrical Components & Equipment',
'Asset Management & Custody Banks', 'Specialized REITs',
'Internet & Direct Marketing Retail', 'Specialty Stores',
'Managed Health Care', 'Oil & Gas Exploration & Production',
'Electronic Components', 'Aerospace & Defense',
'Home Entertainment Software', 'Residential REITs',
'Water Utilities', 'Consumer Finance', 'Banks',
'Oil & Gas Equipment & Services', 'Metal & Glass Containers',
'Health Care Distributors', 'Auto Parts & Equipment',
'Construction & Farm Machinery & Heavy Trucks',
'Real Estate Services', 'Hotels, Resorts & Cruise Lines',
'Fertilizers & Agricultural Chemicals', 'Regional Banks',
'Household Products', 'Integrated Oil & Gas',
'Air Freight & Logistics', 'Cable & Satellite',
'Financial Exchanges & Data', 'Restaurants',
'Industrial Machinery', 'Health Care Supplies', 'Railroads',
'Integrated Telecommunications Services',
'IT Consulting & Other Services', 'Drug Retail',
'Diversified Chemicals', 'Health Care Facilities',
'Industrial Conglomerates', 'Broadcasting & Cable TV',
'Research & Consulting Services', 'Soft Drinks',
'Investment Banking & Brokerage', 'Automobile Manufacturers',
'Copper', 'Electronic Equipment & Instruments',
'Diversified Commercial Services', 'Retail REITs',
'Consumer Electronics', 'Tires & Rubber', 'Industrial Materials',
'Leisure Products', 'Motorcycle Manufacturers',
'Technology Hardware, Storage & Peripherals', 'Computer Hardware',
'Packaged Foods & Meats', 'Paper Packaging', 'Advertising',
'Trucking', 'Networking Equipment',
'Oil & Gas Refining & Marketing & Transportation', 'Homebuilding',
'Distributors', 'Multi-Sector Holdings', 'Alternative Carriers',
'Diversified Financial Services', 'Home Furnishings',
'Construction Materials', 'Tobacco',
'Life Sciences Tools & Services', 'Gold', 'Steel',
'Housewares & Specialties', 'Thrifts & Mortgage Finance',
'Technology, Hardware, Software and Supplies', 'Personal Products',
'Industrial Gases', 'Human Resource & Employment Services',
'Office REITs', 'Brewers', 'Publishing', 'Specialty Retail',
'Apparel, Accessories & Luxury Goods', 'Household Appliances',
'Environmental Services', 'Casinos & Gaming'], dtype=object)
# check for missing values
df.isnull().sum()
Ticker_Symbol 0 Security 0 GICS_Sector 0 GICS_Sub_Industry 0 Current_Price 0 Price_Change 0 Volatility 0 ROE 0 Cash_Ratio 0 Net_Cash_Flow 0 Net_Income 0 Earnings_Per_Share 0 Estimated_Shares_Outstanding 0 P/E_Ratio 0 P/B_Ratio 0 dtype: int64
# check for duplicate rows
df.duplicated().sum()
0
def histogram_boxplot(data, feature, figsize=(15, 3), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (15,10))
kde: whether to show the density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 2, 4))
else:
plt.figure(figsize=(n + 2, 4))
plt.xticks(rotation=90, fontsize=9)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n],
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
Q1. What does the distribution of stock prices look like?
# Question 1. What does the distribution of stock prices look like?
histogram_boxplot(df, 'Current_Price');
Observations:
data.shape
(340, 15)
# Get only numeric columns
num_cols = df.select_dtypes(include=np.number).columns.tolist()
for i in num_cols:
histogram_boxplot(df, i, kde=True)
Observations:
Price_Change, Net_Cash_Flow, Earnings_Per_Share, and P/B _Ratio have a normal distribution with outliers on both ends.Current_Price, Volatility, ROE, Cash_Ratio, Estimated_Shares_Outstanding and P/E Ratioare all right skewed with outliers on the upper end.df.head()
| Ticker_Symbol | Security | GICS_Sector | GICS_Sub_Industry | Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 |
| 2 | ABT | Abbott Laboratories | Health Care | Health Care Equipment | 44.910000 | 11.301121 | 1.273646 | 21 | 67 | 938000000 | 4423000000 | 2.94 | 1.504422e+09 | 15.275510 | -0.394171 |
| 3 | ADBE | Adobe Systems Inc | Information Technology | Application Software | 93.940002 | 13.977195 | 1.357679 | 9 | 180 | -240840000 | 629551000 | 1.26 | 4.996437e+08 | 74.555557 | 4.199651 |
| 4 | ADI | Analog Devices, Inc. | Information Technology | Semiconductors | 55.320000 | -1.827858 | 1.701169 | 14 | 272 | 315120000 | 696878000 | 0.31 | 2.247994e+09 | 178.451613 | 1.059810 |
# graph cdf for each numeric variable
fig, axes = plt.subplots(6, 2, figsize=(20, 20))
fig.suptitle("CDF plot of numerical variables", fontsize=20)
counter = 0
for ii in range(6):
sns.ecdfplot(ax=axes[ii][0], x=df[num_cols[counter]])
counter = counter + 1
if counter != 11:
sns.ecdfplot(ax=axes[ii][1], x=df[num_cols[counter]])
counter = counter + 1
else:
pass
fig.tight_layout(pad=2.0)
Observations:
95% of stocks have a Current_Price less than $600
80% of Price_Change percentage is below ~17%
90% of Net_Income are less than $50,000,000,000
90% of P/E_Ratio’s are less than 50
# barplot on GICS Sectors
labeled_barplot(df, 'GICS_Sector', perc=True);
Observations:
# barplot on GICS Sub Industry
labeled_barplot(df, 'GICS_Sub_Industry', perc=True);
Observations:
sns.pairplot(df[num_cols], diag_kind='kde');
Observations:
Estimated_Shares_Outstanding and ROE
Volatility is revealing 4 clusters/groups.The stocks of which economic sector have seen the maximum price increase on average?
sns.catplot(data=df, x='GICS_Sector', y='Price_Change', kind='bar', height=5, aspect=1.5, ci=None)
plt.xticks(rotation=90);
Observations:
How are the different variables correlated with each other?
# Let's look for correlation
plt.figure(figsize=(12, 5))
sns.heatmap(df[num_cols].corr(), vmin=-1, vmax=1, annot=True,cmap='Spectral', cbar=True);
Observations:
Estimates Shares Outstanding and Net IncomeEarnings per share and Net IncomeCash ratio provides a measure of a company's ability to cover its short-term obligations using only cash and cash equivalents. How does the average cash ratio vary across economic sectors?
sns.catplot(data=df, x='GICS_Sector', y='Cash_Ratio', kind='bar', height=4, aspect=2, ci=None)
plt.xticks(rotation=90);
Observations:
P/E ratios can help determine the relative value of a company's shares as they signify the amount of money an investor is willing to invest in a single share of a company per dollar of its earnings. How does the P/E ratio vary, on average, across economic sectors?
sns.catplot(data=df, x='GICS_Sector', y='P/E_Ratio', kind='bar', height=4, aspect=2, ci=None)
plt.xticks(rotation=90);
Observations:
# plot boxplots of all numberica columns
plt.figure(figsize=(15, 10))
for i , variable in enumerate(num_cols):
plt.subplot(4, 4, i+1)
plt.boxplot(df[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
Observations:
# check for missing values
df.isnull().sum()
Ticker_Symbol 0 Security 0 GICS_Sector 0 GICS_Sub_Industry 0 Current_Price 0 Price_Change 0 Volatility 0 ROE 0 Cash_Ratio 0 Net_Cash_Flow 0 Net_Income 0 Earnings_Per_Share 0 Estimated_Shares_Outstanding 0 P/E_Ratio 0 P/B_Ratio 0 dtype: int64
# check for duplicate rows
df.duplicated().sum()
0
# Scale numeric columns so that we can compare columns of different labels. Using z-score.
scaler=StandardScaler()
subset=df[num_cols].copy()
subset_scaled=scaler.fit_transform(subset)
# creating a dataframe of the new scaled numeric columns
subset_scaled_df=pd.DataFrame(subset_scaled, columns=subset.columns)
# look at first 5 rows on new scaled dataframe
subset_scaled_df.head()
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.393341 | 0.493950 | 0.272749 | 0.989601 | -0.210698 | -0.339355 | 1.554415 | 1.309399 | 0.107863 | -0.652487 | -0.506653 |
| 1 | -0.220837 | 0.355439 | 1.137045 | 0.937737 | 0.077269 | -0.002335 | 0.927628 | 0.056755 | 1.250274 | -0.311769 | -0.504205 |
| 2 | -0.367195 | 0.602479 | -0.427007 | -0.192905 | -0.033488 | 0.454058 | 0.744371 | 0.024831 | 1.098021 | -0.391502 | 0.094941 |
| 3 | 0.133567 | 0.825696 | -0.284802 | -0.317379 | 1.218059 | -0.152497 | -0.219816 | -0.230563 | -0.091622 | 0.947148 | 0.424333 |
| 4 | -0.260874 | -0.492636 | 0.296470 | -0.265515 | 2.237018 | 0.133564 | -0.202703 | -0.374982 | 1.978399 | 3.293307 | 0.199196 |
# explore data after manipulation
plt.figure(figsize=(12, 5))
sns.heatmap(subset_scaled_df.corr(), vmin=-1, vmax=1, annot=True,cmap='Spectral', cbar=True);
# make a copy of the scaled dataset for k-means
kmeans_subset_scaled_df = subset_scaled_df.copy()
# Elbow method to find best k-value using Euclidean distance
# run for models of 1 cluster to 10 clusters
clusters = range(1, 15)
meanDistortions = []
# Using euclidean distance, find average distortion
for k in clusters:
model = KMeans(n_clusters=k, random_state=1)
model.fit(kmeans_subset_scaled_df)
prediction = model.predict(kmeans_subset_scaled_df)
distortion = (
sum(
np.min(cdist(kmeans_subset_scaled_df, model.cluster_centers_, "euclidean"), axis=1)
)
/ subset_scaled_df.shape[0]
)
meanDistortions.append(distortion)
print("Number of Clusters:", k, "\tAverage Distortion:", distortion)
# graph average distortions for the cluster options
plt.plot(clusters, meanDistortions, "bx-")
plt.xlabel("k")
plt.ylabel("Average Distortion")
plt.title("Selecting k-value via the Elbow Method (Euclidean distance)", fontsize=20)
Number of Clusters: 1 Average Distortion: 2.5425069919221697 Number of Clusters: 2 Average Distortion: 2.382318498894466 Number of Clusters: 3 Average Distortion: 2.269236715539074 Number of Clusters: 4 Average Distortion: 2.175554082632614 Number of Clusters: 5 Average Distortion: 2.136852169310594 Number of Clusters: 6 Average Distortion: 2.0713095032364506 Number of Clusters: 7 Average Distortion: 1.9775989987671811 Number of Clusters: 8 Average Distortion: 1.9652603012304684 Number of Clusters: 9 Average Distortion: 1.9311613101229936 Number of Clusters: 10 Average Distortion: 1.8502089919483544 Number of Clusters: 11 Average Distortion: 1.8248014803579016 Number of Clusters: 12 Average Distortion: 1.7525839358924045 Number of Clusters: 13 Average Distortion: 1.722563828836958 Number of Clusters: 14 Average Distortion: 1.6925733787700508
Text(0.5, 1.0, 'Selecting k-value via the Elbow Method (Euclidean distance)')
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(1, 15), timings=True)
visualizer.fit(kmeans_subset_scaled_df) # fit the data to the visualizer
visualizer.show(); # finalize and render figure
The appropriate value of k from the Euclidean elbow curve appears to be 4 or 7. As usual, we will not ulitze the suggest elbow of k=6.
# finding k-value, optimal no. of clusters, with silhouette coefficients
sil_score = []
cluster_list = list(range(2, 15))
for n_clusters in cluster_list:
clusterer = KMeans(n_clusters=n_clusters, random_state=1)
preds = clusterer.fit_predict((kmeans_subset_scaled_df))
# centers = clusterer.cluster_centers_
score = silhouette_score(kmeans_subset_scaled_df, preds)
sil_score.append(score)
print("For n_clusters = {}, silhouette score is {}".format(n_clusters, score))
plt.plot(cluster_list, sil_score)
plt.title("Selecting k-value via Silhouette Method", fontsize=20)
For n_clusters = 2, silhouette score is 0.43969639509980457 For n_clusters = 3, silhouette score is 0.4644405674779404 For n_clusters = 4, silhouette score is 0.45434371948348606 For n_clusters = 5, silhouette score is 0.43169988466492354 For n_clusters = 6, silhouette score is 0.36698524210914063 For n_clusters = 7, silhouette score is 0.10587349598463779 For n_clusters = 8, silhouette score is 0.40021596082234684 For n_clusters = 9, silhouette score is 0.4051539505522535 For n_clusters = 10, silhouette score is 0.1147132918355368 For n_clusters = 11, silhouette score is 0.16268826704862685 For n_clusters = 12, silhouette score is 0.12883008320005324 For n_clusters = 13, silhouette score is 0.1744804713048242 For n_clusters = 14, silhouette score is 0.17457839075773943
Text(0.5, 1.0, 'Selecting k-value via Silhouette Method')
model=KMeans(random_state=1)
visualizer=KElbowVisualizer(model, k=(2,15), metric='silhouette', timings=True)
visualizer.fit(kmeans_subset_scaled_df)
visualizer.show();
We want the silhouette score closest to 1, while not having an excess number of clusters. Thus, we will procede with k=3 or 4
# visualize the optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(kmeans_subset_scaled_df)
visualizer.show();
# visualize the optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(kmeans_subset_scaled_df)
visualizer.show();
# visualize the optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(6, random_state=1))
visualizer.fit(kmeans_subset_scaled_df)
visualizer.show();
# visualize the optimal no. of clusters with silhouette coefficients
visualizer = SilhouetteVisualizer(KMeans(7, random_state=1))
visualizer.fit(kmeans_subset_scaled_df)
visualizer.show();
Elbow method: determined k=4 or 7 is best
Silhouette method: determine k=3 or 4 is best
Based on clients needs and context, we will procede with k=4 as the appropriate number of clusters because:
# fit model to scaled numeric dataset
kmeans = KMeans(n_clusters=4, random_state=0)
kmeans.fit(kmeans_subset_scaled_df)
KMeans(n_clusters=4, random_state=0)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KMeans(n_clusters=4, random_state=0)
# adding kmeans cluster labels to the original dataframe & kmeans scaled df
kmeans_subset_scaled_df["K_means_cluster"] = kmeans.labels_
df["K_means_cluster"] = kmeans.labels_
# check for new column in dataset
df.head(2)
| Ticker_Symbol | Security | GICS_Sector | GICS_Sub_Industry | Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 0 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 0 |
# get the numeric columns of the new dataframe
kmeans_num_cols=num_cols+["K_means_cluster"]
# group by k-value, then find the averages for each column
kmeans_cluster_profile = df[kmeans_num_cols].groupby("K_means_cluster").mean()
# add new column that tells how many items belong to each group/centroid
kmeans_cluster_profile["count_in_each_cluster"] = (
df.groupby("K_means_cluster")["GICS_Sector"].count().values
)
# let's display cluster profiles
kmeans_cluster_profile.style.highlight_max(color="mediumaquamarine", axis=0)
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | count_in_each_cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_cluster | ||||||||||||
| 0 | 72.470050 | 5.059104 | 1.388717 | 34.710145 | 52.938406 | -18021028.985507 | 1471488521.739130 | 3.628424 | 429936522.010181 | 23.808921 | -3.355600 | 276 |
| 1 | 35.165385 | -16.390175 | 2.922214 | 110.961538 | 49.461538 | -192318884.615385 | -4041692307.692307 | -9.849231 | 482610056.416923 | 78.934814 | 1.612922 | 26 |
| 2 | 238.072932 | 13.508882 | 1.777479 | 25.600000 | 276.280000 | 752195440.000000 | 941997320.000000 | 5.889600 | 561429538.460800 | 90.217183 | 14.211965 | 25 |
| 3 | 48.103077 | 6.053507 | 1.163964 | 27.538462 | 77.230769 | 773230769.230769 | 14114923076.923077 | 3.958462 | 3918734987.169230 | 16.098039 | -4.253404 | 13 |
# check that 'K_means_cluster' column has been added
df.head(2)
| Ticker_Symbol | Security | GICS_Sector | GICS_Sub_Industry | Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_cluster | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 0 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 0 |
# list companies in each hierarchical cluster
for cluster in df['K_means_cluster'].unique():
print(f"In Cluster {cluster}, the companies are:")
print(df[df['K_means_cluster'] == cluster]['Security'].unique())
print('-----')
In Cluster 0, the companies are: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Allegion' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'Charter Communications' 'CIGNA Corp.' 'Cincinnati Financial' 'Colgate-Palmolive' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'General Dynamics' 'General Growth Properties Inc.' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kimberly-Clark' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Spectra Energy Corp.' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'S&P Global, Inc.' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] ----- In Cluster 2, the companies are: ['Analog Devices, Inc.' 'Alliance Data Systems' 'Alexion Pharmaceuticals' 'Amgen Inc' 'Amazon.com Inc' 'BIOGEN IDEC Inc.' 'Celgene Corp.' 'Chipotle Mexican Grill' 'Equinix' 'Edwards Lifesciences' 'Facebook' 'First Solar Inc' 'Frontier Communications' 'Halliburton Co.' 'Intuitive Surgical Inc.' "McDonald's Corp." 'Monster Beverage' 'Netflix Inc.' 'Priceline.com Inc' 'Regeneron' 'TripAdvisor' 'Vertex Pharmaceuticals Inc' 'Waters Corporation' 'Wynn Resorts Ltd' 'Yahoo Inc.'] ----- In Cluster 1, the companies are: ['Apache Corporation' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Chesapeake Energy' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'EQT Corporation' 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Hewlett Packard Enterprise' 'Kinder Morgan' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Quanta Services Inc.' 'Range Resources Corp.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] ----- In Cluster 3, the companies are: ['Bank of America Corp' 'Citigroup Inc.' 'Ford Motor' 'Gilead Sciences' 'Intel Corp.' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Merck & Co.' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.'] -----
# Listing the sectors in keach cluster
df.groupby(['K_means_cluster', 'GICS_Sector'])['Security'].count()
K_means_cluster GICS_Sector
0 Consumer Discretionary 33
Consumer Staples 17
Energy 6
Financials 45
Health Care 28
Industrials 52
Information Technology 24
Materials 19
Real Estate 26
Telecommunications Services 2
Utilities 24
1 Energy 22
Industrials 1
Information Technology 2
Materials 1
2 Consumer Discretionary 6
Consumer Staples 1
Energy 1
Health Care 9
Information Technology 6
Real Estate 1
Telecommunications Services 1
3 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 4
Health Care 3
Information Technology 1
Telecommunications Services 2
Name: Security, dtype: int64
plt.figure(figsize=(15, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(kmeans_num_cols):
plt.subplot(4, 3, i + 1)
sns.boxplot(data=df, x="K_means_cluster", y=variable)
plt.tight_layout(pad=2.0)
#compare cluster vs categorical column(`GICS_Sector`)
pd.crosstab(df.K_means_cluster, df.GICS_Sector).style.highlight_max(
color="mediumaquamarine", axis=0
)
| GICS_Sector | Consumer Discretionary | Consumer Staples | Energy | Financials | Health Care | Industrials | Information Technology | Materials | Real Estate | Telecommunications Services | Utilities |
|---|---|---|---|---|---|---|---|---|---|---|---|
| K_means_cluster | |||||||||||
| 0 | 33 | 17 | 6 | 45 | 28 | 52 | 24 | 19 | 26 | 2 | 24 |
| 1 | 0 | 0 | 22 | 0 | 0 | 1 | 2 | 1 | 0 | 0 | 0 |
| 2 | 6 | 1 | 1 | 0 | 9 | 0 | 6 | 0 | 1 | 1 | 0 |
| 3 | 1 | 1 | 1 | 4 | 3 | 0 | 1 | 0 | 0 | 2 | 0 |
# Graph mean values for each k-means cluster
df[kmeans_num_cols].groupby('K_means_cluster').mean().plot.bar(figsize=(12, 6), colormap='Paired');
Cluster 2:
Cluster 3:
# make a copy of the scaled dataset for k-means
hier_subset_scaled_df = subset_scaled_df.copy()
# list of distance metrics
distance_metrics = ["euclidean", "chebyshev", "mahalanobis", "cityblock"]
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for dm in distance_metrics:
for lm in linkage_methods:
Z = linkage(hier_subset_scaled_df, metric=dm, method=lm)
c, coph_dists = cophenet(Z, pdist(hier_subset_scaled_df))
print(
"Cophenetic correlation for {} distance and {} linkage is {}.".format(
dm.capitalize(), lm, c
)
)
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = dm
high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance and single linkage is 0.9232271494002922. Cophenetic correlation for Euclidean distance and complete linkage is 0.7873280186580672. Cophenetic correlation for Euclidean distance and average linkage is 0.9422540609560814. Cophenetic correlation for Euclidean distance and weighted linkage is 0.8693784298129404. Cophenetic correlation for Chebyshev distance and single linkage is 0.9062538164750717. Cophenetic correlation for Chebyshev distance and complete linkage is 0.598891419111242. Cophenetic correlation for Chebyshev distance and average linkage is 0.9338265528030499. Cophenetic correlation for Chebyshev distance and weighted linkage is 0.9127355892367. Cophenetic correlation for Mahalanobis distance and single linkage is 0.925919553052459. Cophenetic correlation for Mahalanobis distance and complete linkage is 0.7925307202850002. Cophenetic correlation for Mahalanobis distance and average linkage is 0.9247324030159736. Cophenetic correlation for Mahalanobis distance and weighted linkage is 0.8708317490180428. Cophenetic correlation for Cityblock distance and single linkage is 0.9334186366528574. Cophenetic correlation for Cityblock distance and complete linkage is 0.7375328863205818. Cophenetic correlation for Cityblock distance and average linkage is 0.9302145048594667. Cophenetic correlation for Cityblock distance and weighted linkage is 0.731045513520281.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with Euclidean distance and average linkage.
Note: The best distance metric is Euclidean.Thus we will explore 2 other linkage methods that only work with Euclidean: ward and centroid linkage
Let's explore different linkage methods with Euclidean distance only.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
high_cophenet_corr = 0
high_dm_lm = [0, 0]
for lm in linkage_methods:
Z = linkage(hier_subset_scaled_df, metric="euclidean", method=lm)
c, coph_dists = cophenet(Z, pdist(hier_subset_scaled_df))
print("Cophenetic correlation for {} linkage is {}.".format(lm, c))
if high_cophenet_corr < c:
high_cophenet_corr = c
high_dm_lm[0] = "euclidean"
high_dm_lm[1] = lm
Cophenetic correlation for single linkage is 0.9232271494002922. Cophenetic correlation for complete linkage is 0.7873280186580672. Cophenetic correlation for average linkage is 0.9422540609560814. Cophenetic correlation for centroid linkage is 0.9314012446828154. Cophenetic correlation for ward linkage is 0.7101180299865353. Cophenetic correlation for weighted linkage is 0.8693784298129404.
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
"Highest cophenetic correlation is {}, which is obtained with {} linkage.".format(
high_cophenet_corr, high_dm_lm[1]
)
)
Highest cophenetic correlation is 0.9422540609560814, which is obtained with average linkage.
Observations:
Let's see the dendrograms for Euclidean distance with the different linkage methods.
# list of linkage methods
linkage_methods = ["single", "complete", "average", "centroid", "ward", "weighted"]
# lists to save results of cophenetic correlation calculation
compare_cols = ["Linkage", "Cophenetic Coefficient"]
# to create a subplot image
fig, axs = plt.subplots(len(linkage_methods), 1, figsize=(15, 30))
# We will enumerate through the list of linkage methods above
# For each linkage method, we will plot the dendrogram and calculate the cophenetic correlation
for i, method in enumerate(linkage_methods):
Z = linkage(hier_subset_scaled_df, metric="euclidean", method=method)
dendrogram(Z, ax=axs[i])
axs[i].set_title(f"Dendrogram ({method.capitalize()} Linkage)")
coph_corr, coph_dist = cophenet(Z, pdist(hier_subset_scaled_df))
axs[i].annotate(
f"Cophenetic\nCorrelation\n{coph_corr:0.2f}",
(0.80, 0.80),
xycoords="axes fraction",
)
Observations
# create model with cluster, distance, and linkage from above
HCmodel = AgglomerativeClustering(n_clusters=4, affinity='euclidean', linkage='ward')
HCmodel.fit(hier_subset_scaled_df)
AgglomerativeClustering(affinity='euclidean', n_clusters=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
AgglomerativeClustering(affinity='euclidean', n_clusters=4)
# add cluster number column to original and hierarchical scaled df
hier_subset_scaled_df["HC_Clusters"] = HCmodel.labels_
df["HC_Clusters"] = HCmodel.labels_
# look at new columns in the df
df.head(2)
| Ticker_Symbol | Security | GICS_Sector | GICS_Sub_Industry | Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_cluster | HC_Clusters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 0 | 3 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 0 | 3 |
# get numeric columns of dataset after adding cluster column
hc_num_col = df.select_dtypes(include=np.number).columns.tolist()
# averages for each cluster by column
hc_cluster_profile = df[hc_num_col].groupby("HC_Clusters").mean()
# count how many values exist in each cluster
hc_cluster_profile["count_in_each_segments"] = (
df.groupby("HC_Clusters")["GICS_Sector"].count().values
)
# let's display cluster profiles
hc_cluster_profile.style.highlight_max(color="mediumaquamarine", axis=0)
| Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_cluster | count_in_each_segments | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | |||||||||||||
| 0 | 48.006208 | -11.263107 | 2.590247 | 196.551724 | 40.275862 | -495901724.137931 | -3597244655.172414 | -8.689655 | 486319827.294483 | 75.110924 | -2.162622 | 0.793103 | 29 |
| 1 | 326.198218 | 10.563242 | 1.642560 | 14.400000 | 309.466667 | 288850666.666667 | 864498533.333333 | 7.785333 | 544900261.301333 | 113.095334 | 19.142151 | 2.000000 | 15 |
| 2 | 42.848182 | 6.270446 | 1.123547 | 22.727273 | 71.454545 | 558636363.636364 | 14631272727.272728 | 3.410000 | 4242572567.290909 | 15.242169 | -4.924615 | 3.000000 | 11 |
| 3 | 72.760400 | 5.213307 | 1.427078 | 25.603509 | 60.392982 | 79951512.280702 | 1538594322.807018 | 3.655351 | 446472132.228456 | 24.722670 | -2.647194 | 0.101754 | 285 |
df.head(2)
| Ticker_Symbol | Security | GICS_Sector | GICS_Sub_Industry | Current_Price | Price_Change | Volatility | ROE | Cash_Ratio | Net_Cash_Flow | Net_Income | Earnings_Per_Share | Estimated_Shares_Outstanding | P/E_Ratio | P/B_Ratio | K_means_cluster | HC_Clusters | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | AAL | American Airlines Group | Industrials | Airlines | 42.349998 | 9.999995 | 1.687151 | 135 | 51 | -604000000 | 7610000000 | 11.39 | 6.681299e+08 | 3.718174 | -8.784219 | 0 | 3 |
| 1 | ABBV | AbbVie | Health Care | Pharmaceuticals | 59.240002 | 8.339433 | 2.197887 | 130 | 77 | 51000000 | 5144000000 | 3.15 | 1.633016e+09 | 18.806350 | -8.750068 | 0 | 3 |
# list companies in each hierarchical cluster
for cluster in df['HC_Clusters'].unique():
print(f"In Cluster {cluster}, the companies are:")
print(df[df['HC_Clusters'] == cluster]['Security'].unique())
print('-----')
In Cluster 3, the companies are: ['American Airlines Group' 'AbbVie' 'Abbott Laboratories' 'Adobe Systems Inc' 'Analog Devices, Inc.' 'Archer-Daniels-Midland Co' 'Ameren Corp' 'American Electric Power' 'AFLAC Inc' 'American International Group, Inc.' 'Apartment Investment & Mgmt' 'Assurant Inc' 'Arthur J. Gallagher & Co.' 'Akamai Technologies Inc' 'Albemarle Corp' 'Alaska Air Group Inc' 'Allstate Corp' 'Applied Materials Inc' 'AMETEK Inc' 'Affiliated Managers Group Inc' 'Ameriprise Financial' 'American Tower Corp A' 'AutoNation Inc' 'Anthem Inc.' 'Aon plc' 'Amphenol Corp' 'Arconic Inc' 'Activision Blizzard' 'AvalonBay Communities, Inc.' 'Broadcom' 'American Water Works Company Inc' 'American Express Co' 'Boeing Company' 'Baxter International Inc.' 'BB&T Corporation' 'Bard (C.R.) Inc.' 'BIOGEN IDEC Inc.' 'The Bank of New York Mellon Corp.' 'Ball Corp' 'Bristol-Myers Squibb' 'Boston Scientific' 'BorgWarner' 'Boston Properties' 'Caterpillar Inc.' 'Chubb Limited' 'CBRE Group' 'Crown Castle International Corp.' 'Carnival Corp.' 'Celgene Corp.' 'CF Industries Holdings Inc' 'Citizens Financial Group' 'Church & Dwight' 'C. H. Robinson Worldwide' 'CIGNA Corp.' 'Cincinnati Financial' 'Comerica Inc.' 'CME Group Inc.' 'Cummins Inc.' 'CMS Energy' 'Centene Corporation' 'CenterPoint Energy' 'Capital One Financial' 'The Cooper Companies' 'CSX Corp.' 'CenturyLink Inc' 'Cognizant Technology Solutions' 'Citrix Systems' 'CVS Health' 'Chevron Corp.' 'Dominion Resources' 'Delta Air Lines' 'Du Pont (E.I.)' 'Deere & Co.' 'Discover Financial Services' 'Quest Diagnostics' 'Danaher Corp.' 'The Walt Disney Company' 'Discovery Communications-A' 'Discovery Communications-C' 'Delphi Automotive' 'Digital Realty Trust' 'Dun & Bradstreet' 'Dover Corp.' 'Dr Pepper Snapple Group' 'Duke Energy' 'DaVita Inc.' 'eBay Inc.' 'Ecolab Inc.' 'Consolidated Edison' 'Equifax Inc.' "Edison Int'l" 'Eastman Chemical' 'Equity Residential' 'EQT Corporation' 'Eversource Energy' 'Essex Property Trust, Inc.' 'E*Trade' 'Eaton Corporation' 'Entergy Corp.' 'Edwards Lifesciences' 'Exelon Corp.' "Expeditors Int'l" 'Expedia Inc.' 'Extra Space Storage' 'Fastenal Co' 'Fortune Brands Home & Security' 'FirstEnergy Corp' 'Fidelity National Information Services' 'Fiserv Inc' 'FLIR Systems' 'Fluor Corp.' 'Flowserve Corporation' 'FMC Corporation' 'Federal Realty Investment Trust' 'First Solar Inc' 'General Dynamics' 'General Growth Properties Inc.' 'Gilead Sciences' 'Corning Inc.' 'General Motors' 'Genuine Parts' 'Garmin Ltd.' 'Goodyear Tire & Rubber' 'Grainger (W.W.) Inc.' 'Halliburton Co.' 'Hasbro Inc.' 'Huntington Bancshares' 'HCA Holdings' 'Welltower Inc.' 'HCP Inc.' 'Hartford Financial Svc.Gp.' 'Harley-Davidson' "Honeywell Int'l Inc." 'Hewlett Packard Enterprise' 'HP Inc.' 'Hormel Foods Corp.' 'Henry Schein' 'Host Hotels & Resorts' 'The Hershey Company' 'Humana Inc.' 'International Business Machines' 'IDEXX Laboratories' 'Intl Flavors & Fragrances' 'International Paper' 'Interpublic Group' 'Iron Mountain Incorporated' 'Illinois Tool Works' 'Invesco Ltd.' 'J. B. Hunt Transport Services' 'Jacobs Engineering Group' 'Juniper Networks' 'Kimco Realty' 'Kansas City Southern' 'Leggett & Platt' 'Lennar Corp.' 'Laboratory Corp. of America Holding' 'LKQ Corporation' 'L-3 Communications Holdings' 'Lilly (Eli) & Co.' 'Lockheed Martin Corp.' 'Alliant Energy Corp' 'Leucadia National Corp.' 'Southwest Airlines' 'Level 3 Communications' 'LyondellBasell' 'Mastercard Inc.' 'Mid-America Apartments' 'Macerich' "Marriott Int'l." 'Masco Corp.' 'Mattel Inc.' "McDonald's Corp." "Moody's Corp" 'Mondelez International' 'MetLife Inc.' 'Mohawk Industries' 'Mead Johnson' 'McCormick & Co.' 'Martin Marietta Materials' 'Marsh & McLennan' '3M Company' 'Altria Group Inc' 'The Mosaic Company' 'Marathon Petroleum' 'Merck & Co.' 'M&T Bank Corp.' 'Mettler Toledo' 'Mylan N.V.' 'Navient' 'NASDAQ OMX Group' 'NextEra Energy' 'Newmont Mining Corp. (Hldg. Co.)' 'Nielsen Holdings' 'Norfolk Southern Corp.' 'Northern Trust Corp.' 'Nucor Corp.' 'Newell Brands' 'Realty Income Corporation' 'Omnicom Group' "O'Reilly Automotive" "People's United Financial" 'Pitney-Bowes' 'PACCAR Inc.' 'PG&E Corp.' 'Public Serv. Enterprise Inc.' 'PepsiCo Inc.' 'Principal Financial Group' 'Procter & Gamble' 'Progressive Corp.' 'Pulte Homes Inc.' 'Philip Morris International' 'PNC Financial Services' 'Pentair Ltd.' 'Pinnacle West Capital' 'PPG Industries' 'PPL Corp.' 'Prudential Financial' 'Phillips 66' 'Quanta Services Inc.' 'Praxair Inc.' 'PayPal' 'Ryder System' 'Royal Caribbean Cruises Ltd' 'Robert Half International' 'Roper Industries' 'Republic Services Inc' 'SCANA Corp' 'Charles Schwab Corporation' 'Sealed Air' 'Sherwin-Williams' 'SL Green Realty' 'Scripps Networks Interactive Inc.' 'Southern Co.' 'Simon Property Group Inc' 'Stericycle Inc' 'Sempra Energy' 'SunTrust Banks' 'State Street Corp.' 'Skyworks Solutions' 'Synchrony Financial' 'Stryker Corp.' 'Molson Coors Brewing Company' 'Tegna, Inc.' 'Torchmark Corp.' 'Thermo Fisher Scientific' 'TripAdvisor' 'The Travelers Companies Inc.' 'Tractor Supply Company' 'Tyson Foods' 'Tesoro Petroleum Co.' 'Total System Services' 'Texas Instruments' 'Under Armour' 'United Continental Holdings' 'UDR Inc' 'Universal Health Services, Inc.' 'United Health Group Inc.' 'Unum Group' 'Union Pacific' 'United Parcel Service' 'United Technologies' 'Varian Medical Systems' 'Valero Energy' 'Vulcan Materials' 'Vornado Realty Trust' 'Verisk Analytics' 'Verisign Inc.' 'Vertex Pharmaceuticals Inc' 'Ventas Inc' 'Wec Energy Group Inc' 'Whirlpool Corp.' 'Waste Management Inc.' 'Western Union Co' 'Weyerhaeuser Corp.' 'Wyndham Worldwide' 'Wynn Resorts Ltd' 'Xcel Energy Inc' 'XL Capital' 'Dentsply Sirona' 'Xerox Corp.' 'Xylem Inc.' 'Yum! Brands Inc' 'Zimmer Biomet Holdings' 'Zions Bancorp' 'Zoetis'] ----- In Cluster 1, the companies are: ['Alliance Data Systems' 'Alexion Pharmaceuticals' 'Amgen Inc' 'Amazon.com Inc' 'Chipotle Mexican Grill' 'Equinix' 'Facebook' 'Frontier Communications' 'Intuitive Surgical Inc.' 'Monster Beverage' 'Netflix Inc.' 'Priceline.com Inc' 'Regeneron' 'Waters Corporation' 'Yahoo Inc.'] ----- In Cluster 0, the companies are: ['Allegion' 'Apache Corporation' 'Anadarko Petroleum Corp' 'Baker Hughes Inc' 'Chesapeake Energy' 'Charter Communications' 'Colgate-Palmolive' 'Cabot Oil & Gas' 'Concho Resources' 'Devon Energy Corp.' 'EOG Resources' 'Freeport-McMoran Cp & Gld' 'Hess Corporation' 'Kimberly-Clark' 'Kinder Morgan' 'Marathon Oil Corp.' 'Murphy Oil' 'Noble Energy Inc' 'Newfield Exploration Co' 'National Oilwell Varco Inc.' 'ONEOK' 'Occidental Petroleum' 'Range Resources Corp.' 'Spectra Energy Corp.' 'S&P Global, Inc.' 'Southwestern Energy' 'Teradata Corp.' 'Williams Cos.' 'Cimarex Energy'] ----- In Cluster 2, the companies are: ['Bank of America Corp' 'Citigroup Inc.' 'Ford Motor' 'Intel Corp.' 'JPMorgan Chase & Co.' 'Coca Cola Company' 'Pfizer Inc.' 'AT&T Inc' 'Verizon Communications' 'Wells Fargo' 'Exxon Mobil Corp.'] -----
# Listing the sectors in each cluster
df.groupby(['HC_Clusters', 'GICS_Sector'])['Security'].count()
HC_Clusters GICS_Sector
0 Consumer Discretionary 1
Consumer Staples 2
Energy 22
Financials 1
Industrials 1
Information Technology 1
Materials 1
1 Consumer Discretionary 3
Consumer Staples 1
Health Care 5
Information Technology 4
Real Estate 1
Telecommunications Services 1
2 Consumer Discretionary 1
Consumer Staples 1
Energy 1
Financials 4
Health Care 1
Information Technology 1
Telecommunications Services 2
3 Consumer Discretionary 35
Consumer Staples 15
Energy 7
Financials 44
Health Care 34
Industrials 52
Information Technology 27
Materials 19
Real Estate 26
Telecommunications Services 2
Utilities 24
Name: Security, dtype: int64
# look at sectors in each hierarchical cluster
df.groupby(['HC_Clusters', 'GICS_Sector'])['Security'].apply(list)
HC_Clusters GICS_Sector
0 Consumer Discretionary [Charter Communications]
Consumer Staples [Colgate-Palmolive, Kimberly-Clark]
Energy [Apache Corporation, Anadarko Petroleum Corp, ...
Financials [S&P Global, Inc.]
Industrials [Allegion]
Information Technology [Teradata Corp.]
Materials [Freeport-McMoran Cp & Gld]
1 Consumer Discretionary [Amazon.com Inc, Chipotle Mexican Grill, Price...
Consumer Staples [Monster Beverage]
Health Care [Alexion Pharmaceuticals, Amgen Inc, Intuitive...
Information Technology [Alliance Data Systems, Facebook, Netflix Inc....
Real Estate [Equinix]
Telecommunications Services [Frontier Communications]
2 Consumer Discretionary [Ford Motor]
Consumer Staples [Coca Cola Company]
Energy [Exxon Mobil Corp.]
Financials [Bank of America Corp, Citigroup Inc., JPMorga...
Health Care [Pfizer Inc.]
Information Technology [Intel Corp.]
Telecommunications Services [AT&T Inc, Verizon Communications]
3 Consumer Discretionary [AutoNation Inc, BorgWarner, Carnival Corp., T...
Consumer Staples [Archer-Daniels-Midland Co, Church & Dwight, C...
Energy [Chevron Corp., EQT Corporation, Halliburton C...
Financials [AFLAC Inc, American International Group, Inc....
Health Care [AbbVie, Abbott Laboratories, Anthem Inc., Bax...
Industrials [American Airlines Group, Alaska Air Group Inc...
Information Technology [Adobe Systems Inc, Analog Devices, Inc., Akam...
Materials [Albemarle Corp, Ball Corp, CF Industries Hold...
Real Estate [Apartment Investment & Mgmt, American Tower C...
Telecommunications Services [CenturyLink Inc, Level 3 Communications]
Utilities [Ameren Corp, American Electric Power, America...
Name: Security, dtype: object
plt.figure(figsize=(15, 20))
plt.suptitle("Boxplot of numerical variables for each cluster")
for i, variable in enumerate(hc_num_col):
plt.subplot(5, 3, i + 1)
sns.boxplot(data=df, x="HC_Clusters", y=variable)
plt.tight_layout(pad=2.0)
Let's compare Cluster to GICS Sector
#compare cluster vs categorical column(`GICS_Sector`)
pd.crosstab(df.HC_Clusters, df.GICS_Sector).style.highlight_max(color="mediumaquamarine", axis=0)
| GICS_Sector | Consumer Discretionary | Consumer Staples | Energy | Financials | Health Care | Industrials | Information Technology | Materials | Real Estate | Telecommunications Services | Utilities |
|---|---|---|---|---|---|---|---|---|---|---|---|
| HC_Clusters | |||||||||||
| 0 | 1 | 2 | 22 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 |
| 1 | 3 | 1 | 0 | 0 | 5 | 0 | 4 | 0 | 1 | 1 | 0 |
| 2 | 1 | 1 | 1 | 4 | 1 | 0 | 1 | 0 | 0 | 2 | 0 |
| 3 | 35 | 15 | 7 | 44 | 34 | 52 | 27 | 19 | 26 | 2 | 24 |
# Graph mean values for each hierarchical cluster
df[hc_num_col].groupby('HC_Clusters').mean().plot.bar(figsize=(12, 6), colormap='Paired');
Cluster 0:
Cluster 1:
Cluster 2:
Cluster 3:
Observations *
Observations
# Total rows in each k-means cluster
kmeans_cluster_profile["count_in_each_cluster"]
K_means_cluster 0 276 1 26 2 25 3 13 Name: count_in_each_cluster, dtype: int64
# Total rows in each hierarchical cluster
hc_cluster_profile["count_in_each_segments"]
HC_Clusters 0 29 1 15 2 11 3 285 Name: count_in_each_segments, dtype: int64
# How many rows belong to a cluster in K-means and hierarchical
count = ((df['K_means_cluster'] == 0) & (df['HC_Clusters'] == 3)).sum()
print('Number of rows are shared between K-Means cluster 0 and Hierarchical cluster 3:', count)
count = ((df['K_means_cluster'] == 1) & (df['HC_Clusters'] == 1)).sum()
print('Number of rows are shared between K-Means cluster 1 and Hierarchical cluster 1:', count)
count = ((df['K_means_cluster'] == 2) & (df['HC_Clusters'] == 0)).sum()
print('Number of rows are shared between K-Means cluster 2 and Hierarchical cluster 0:', count)
count = ((df['K_means_cluster'] == 3) & (df['HC_Clusters'] == 2)).sum()
print('Number of rows are shared between K-Means cluster 3 and Hierarchical cluster 2:', count)
Number of rows are shared between K-Means cluster 0 and Hierarchical cluster 3: 270 Number of rows are shared between K-Means cluster 1 and Hierarchical cluster 1: 0 Number of rows are shared between K-Means cluster 2 and Hierarchical cluster 0: 0 Number of rows are shared between K-Means cluster 3 and Hierarchical cluster 2: 11
Observation:
Important notes:
The K-means clustering approach is more computationally efficient than hierachical clustering. Thus the simpler model that is cheaper to utilize.
If simplicity is not required, then Hierarchical clustering is giving a better model.
Both clustering methods arrived at the same number of clusters, 4.
Both clustering methods arrived at similar distinct clusters, neither is more compact..
It would be advised to move forward with the k-means cluster final model as it is cost effective, efficient, and gives similar results to the more complex hierarchical cluster model.
Push these stocks clusters out to advisors and investors so that they can align their investment goals to the 4 cluster classes revealed by our modeling.
Do further analysis on each cluster and GICS Sector to get more detailed findings.
Hierarchical clustering takea-aways:
Cluster 0 Most of these companies are the the energy sector.
Cluster 1 Most of these companies are in the finacial and telecommunications sector.
Cluster 2 Most of these companies are in the healthcare and technology sector.
Cluster 3 Most of these companies are very diversified across the sectors, except energy.
Let's use Principal Component Analysis (PCA) to visualize the clusters. Unfortunately only can visualize 2-D or 3-D clusters (thus reducing only to 2 or 3 columns). Not good for any anaysis.
Unfortunately we will lose some feature information and only arrive at the 2 most principal components.
from sklearn.decomposition import PCA
# setting the number of compnents to 2 (2-D)
pca=PCA(n_components=2)
#transform data and storing results in a dataframe
X_reduced_pca=pca.fit_transform(subset_scaled)
reduced_df_pca=pd.DataFrame(X_reduced_pca,columns=['Component 1','Component 2'])
# checking the amount of variance explained
pca.explained_variance_ratio_.sum()
0.3712899567725855
# visualize principal component scatterplot & target variable
sns.scatterplot(data=reduced_df_pca,x='Component 1',y='Component 2');
# visualize principal component scatterplot & target variable, separated by hierarchical clusters
sns.scatterplot(data=reduced_df_pca,x='Component 1',y='Component 2', hue=df['HC_Clusters'], palette='rainbow');
# visualize principal component scatterplot & target variable, separated by hierarchical clusters
sns.scatterplot(data=reduced_df_pca,x='Component 1',y='Component 2', hue=df['K_means_cluster'], palette='rainbow');